Racial profiling in policing has always been a huge point of contention. It is defined as police practice that relies on certain racial characteristics they believe to be associated with crime. This issue is especially prominent in selective policing strategies such as “stop-and-frisk”, i.e. deciding which pedestrains to search, question, or frisk, a program in New York City that has caused numerous civil rights controversies. Police have defended the strategy by claiming that, while stop-rates are higher for non-white citizens, it only reflects reasonable prior knowledge or experience such as historic crime rates by race, or that other counfounding factors such as variation between precincts (neighbourhoods).
Our goal is to find out whether minorities (black and hispanic citizens) are disproportionately targeted even when taking into account to historic crime rates (approximated by arrest rates from the previous year) and variances in precinct as a random effect. To do this, we take the approach of modelling stop rates based on previous year’s arrests, categorized by ethnicity and crime type, while making precinct a random effect in the sampling model. We then examine the coefficients of the posterior distribution to see whether past year’s crimes would affect minority and white group stop rates in a similar way.
A box plot of all 75 precincts showing the variation in stops
Our data source is the stop-and-frisk dataset compiled by Andrew Gelman and Jennifer Hill in his book, and can be found at this site. It contains the times pedestrains are stopped in each of hte 75 precincts in New York City in 2005. We benefitted from his work and hence no further cleaning is necessary. Every row is a precinct’s number of stops for a specific ethnicity and crime type, i.e. one precinct’s total number of stops are spread across several rows. The following variables are in the dataset:
Numerica variables: * stops: the number of stops for a specific ethnicity and crime type, specified by crime and eth variables in the same row * pop: population of a specific ethnicity in the precinct specified by the precinct variable in the same row * past.arrests: the number of arrests in the previous year for the ethnicity and crime type specified by crime and eth variables in the same row
Factor variables: * precinct: numbered from 1-75 * eth: 1=black, 2=hispanic, 3=white * crime: 1=violent, 2=weapons, 3=property, 4=drug
Since our data is count data, it is reasonable to use either the Poisson or Negative Binomial distribution for the sampling model. We realized by running quick summary statistics on response variable stop rate that there is overdispersion, i.e. the variance of 47254 is far greater than the mean of 146. This gives us the idea that Negative Binomial distribution is more appropriate to account for overdispersion factors. We also do not have any zero values, so zero-inflated versions of Poisson or Negative Binomial are not applicable. We validated this by fitting a Poisson and Negative Binomial regression model with precinct as the random effect, and by comparing the AIC/BIC it shows that indeed the Negative Binomial is a much better model fit.
We also want to divide up the precincts by the portion of black population in it, as we want to account for previous police claims that relate to the demographics of a precinct. For example, police claim that they are more likely to stop black pedestrians in the case where the black pedestrian is in a predominantely white neighbourhood, and vice versa. Hence, by categorizing precincts based on the proportion of black population, we can use it to examine if disproportionate stopping of minorities still exists in predominantely white or black neighborhoods. Hence we created the variable pblack which categorizes precincts with <10%, 10-40%, or > 40% black population.
Next, we want to incorporate the hiearchial element into the model, as we have 3 different precinct categories (<10%, 10-40%, or > 40% black population) and 4 different crime types (violent, weapons, property, drug). Previous literature has suggested that police reports having different strategies for different precinct, for example being more aggressive towards stopping pedestrains in high crime precincts, or having higher stop rates because some crime types are easier to spot than others. Hence, it is not ideal to do regression without distinguishing effects of different precincts and crime types. This calls for a hiearchial Negative Binomial program as follows:
\(\lambda\) = mean, \(\alpha\) = over-dispersion factor
\[ \lambda_p= exp(\vec{x^T\beta_p})\\ \begin{eqnarray} Y_{ep}|\lambda_p, \alpha_e & \sim & NB(\lambda_p \alpha_e)\\ \end{eqnarray} \\ Pr(Y_{ep} = y_{ep}|\lambda_p,\alpha_e) = \frac{\Gamma(y_{ep} + \alpha_e^{-1})}{\Gamma(\alpha_e^{-1})\Gamma(y_{ep} + 1)}(\frac{1}{1+ \alpha_e\lambda_p})^{a^{-1}}(\frac{\alpha_e\lambda_p}{1+\alpha_e\lambda_p})^y_{ep} \]
As for the variable past.arrest (last year’s arrest rate), we scale by the factor of 15/12 to account for the fact that we have 15 months of stop data but only 12 months of arrest data.
We have 12 sampling models defined by different combinations of 3 precinct categories and 4 crime types. We assign Normal distribution distribution as the prior for regression coefficient \(\beta\) as it is continuous and takes range from negative to positive infinity. We also assign Gamma distribution as the prior for over-dispersion factor \(\alpha\) as it is continuous and positive.
\[ \begin{eqnarray} \beta_p & \sim & \mathrm{N}(\overline{\beta}, A^{-1})\\ \alpha_e & \sim & Gamma(a,b)\\ \end{eqnarray} \]
We generate 12 posterior distributions on the parameters \(\beta\) and \(\alpha\). We conduct separate comparisons for violent crimes, weapons offenses, property crimes, and drug crimes. For each, we model the number of stops by ethnic group “e” and precinct “p” for that crime type, using as a baseline the DCJS arrest count nep for that ethnic group, precinct, and crime type.
Generate chain of MCMC samples of parameters \(\beta\) and \(\alpha\) from 12 posterior distributions using metropolis algorithm. Since our goal is to predict stop rate for this year in order to compare to last year’s arrest rate, we could simulate the predictive distribution \([Z|data]\) using the posterior distribution \([\beta|data]\). Based on the joint distribution of Z, \(\beta\) and \(\alpha\):
\[ [Z, \beta, \alpha|data] \propto [Z|data, \beta, \alpha][\beta|data]\\ =[Z|\beta, \alpha][\beta|data] \]
The resulting samples will be \((Z_1, \beta_1), ..., (Z_S, \beta_S)\). Ignoring \(\beta\), we generate posterior predive samples of \(Z\). By comparing simulated samples with real data, we realize the posterior distribution tends to shrink towards 0 from positive side, with longer but thinner tails than the real distribution of data. It implies our posterior model is likely to predict more precincts with extreme stop rate, while failing to capture the trend of precincts with moderate stop rate. For most of the time, our model learns the variability of tails sufficiently. It captures the average stop rate for different precinct categories and crime types pretty well, although with comparatively large variance.
The resulting samples will be \((Z_1, \beta_1), ..., (Z_S, \beta_S)\). Ignoring \(\beta\), we generate posterior predictive samples of \(Z\).
Below are the posterior predictive checks for violent crimes in precincts with less than 10% African American population. Our model captures the majority of the actual stops well, but doesn’t have a very accurate grasp on the outliers. The variances of our predicted stops also varies more than we expected, but the actual mean is centered around the middle point of our predicted cases, so we think our model is a good fit overall.
<<<<<<< HEAD
Estimates for the ethnicity and precinct-by-ethnicity-level coefficients for ethnicity, broken down by crime type
Estimates for the ethnicity and precinct-by-ethnicity-level coefficients for ethnicity, broken down by crime type
Estimated rates of different ethnicities for stopped for different categories of crime based on coefficients from hiearchical regression
Black people are stopped more frequently for all crime types except for property crimes across precincts with different proportion of black population, save for property crime and violent crime in majority non-black neighbourhoods. Notably, in precicnts with majority black population (40% and above), the coefficients for the effect of ethnicity on chance of being stopped for violent crimes are negative for hispanics and whites, but not for blacks; i.e. being black uniquely leads to a higher chance of being stopped for violent crimes in majority black neighborhoods. This is an interesting contrast to police statement that they are more likely to do “racial incongruity” stops. i.e. having high chances of minority stops in predominantely white precincts.
for precincts with 10-40% black population, whites are on average more than 3x less likely to be stopped for violent crimes with the same past year arrest rate for that ethnicity, with a average coefficient of -1.29 for blacks and -0.51 whites. Notably, this coefficient is also significant at the 95% credible interval with coefficient (-3.03, -0.840).
for property crime in a precinct with more than 40% black population, whites are always more likely to be targeted than blacks within a 95% credible interval. The coefficient for black is (-1.35, 0.43), and the coefficient for whites is (0.047, 0.68). However, whites are the least likely to be stopped for all other crime types regardless of precinct black population percentage except for property crime.
for precincts with over 40% black population, black people are on average 1.4-1.5x more likely to be stopped from drug crimes than hispanics and whites.
the coefficients of dcjs (the number of arrests last year) are consistent across different crime types and precinct categories, as the 95% credible interval of coefficient is roughly (0.5, 1) all around. i.e. for all races, previous year’s arrest rate for their race makes it more likely for pedestrians of that race to be stopped.
We analyze the stop rates at which New Yorkers of different ethinicy were stopped by the police on the city streets, to assess the central claim that race-specific stop rates reflect nothing more than race-specific crime rates. We aim to assess whether police have reasonable prior knowledge such as historic arrest rates by race to justify their action to stop black more frequently than non-black, or the trend to do “racial congruity stops” for different ethnicities without specifically discriminating against black. From our model, we conclude that historic arrest rates have no impact on stop rates of different racial groups across all crime types as it is approximately consistent. Moreover, whites are the least likely to be stopped for all crime types regardless of precinct balck population, except for property crime which we believe is less likely to be influenced by racial profiling. We conclude that members of minority groups, especially black were stopped more often than whites, both in comparison to their overall population and to the estimated rates of crime that they have committed. ## MCMC convergence and mixing diagnostics
We conducted convergence check for our posterior samples of the coefficients, and it appears that convergence occured at about 100-200 iterations. Hence, we threw out the first 100 draws that are burn-ins and retained 900 valid draws for each of our 12 models. The right panel shows our attempt at thinning the samples because they are not indpendent and identically distributed. However, given we started out with high variance and overdispersion, and we do not really have storage concerns, so we decided against thinning.
Convergence check and thinning
missing predictors, overdispersion, impossible to have every single predictor (i.e. every single factor the police considers when making a stop), that would include factors like age and gender of the pedestrain as well as many circumstantial cues.
controlled for precinct to precinct variation and crime type, but not population and/or demographics. However, there is already plenty literature that demonstrate that minority groups are stopped more oftan even when taking into account their overall population, including nuanced analysis done by taking into account day and mnight population changes to account for commerical and business activity.
Confidence intervals and quantiles for all coefficients, grouped by crime type and precinct black population percentage
From the plots below, other than median income, all of our numeric predictors appear reasonable with no extreme outliers that would necessitate transformation. There are a few observations with median income of around $90,000, which is a large jump from around $70,000 where the next highest observations are. We log-transform median income to correct for the variable’s skewed distribution.
>>>>>>> f4f6ba6ed989bd912ad16a882962ca8d2b22d537We analyze the rates at which New Yorkers of differ- ent ethnic groups were stopped by the police on the city streets, to assess the central claim that race-specific stop rates reflect nothing more than race-specific crime rates.
We address this question in several different ways using data on police stops and conclude that members of minority groups were stopped more often than whites, both in comparison to their overall population and to the estimated rates of crime that they have committed.
When compared in that way, the ratio of stops to DCJS arrests was 1.24 for whites, 1.54 for blacks, and 1.72 for Hispanics; based on this comparison, blacks are stopped 23% more often than whites and Hispanics are stopped 39% more often than whites.
But suppose that the police make more stops in high-crime areas but treat the different ethnic groups equally within any locality. Then the citywide ratios could show significant differences between ethnic groups even if stops were determined entirely by location rather than by ethnicity. Because it is pos- sible that the patterns are systematically different in neigh- borhoods with different ethnic compositions, we divided the precincts into three categories in terms of their black popula- tion: precincts that were less than 10% black, 10–40% black, and more than 40% black.
We conducted separate comparisons for violent crimes, weapons offenses, property crimes, and drug crimes. For each, we modeled the number of stops yep by ethnic group e and precinct p for that crime type, using as a baseline the DCJS arrest count nep for that ethnic group, precinct, and crime type.
For each fit, we simulated three several independent Markov chains from different starting points, stopping when the simulations from each chain alone were as variable as those from all of the chains mixed together (Gelman and Rubin 1992). We then gathered the last half of the simulated chains and used these to compute posterior estimates and(standard errors. For the analyses reported in this article, 10,000(iterations were always sufficient for mixing of the sequences.
>>>>>>> f4f6ba6ed989bd912ad16a882962ca8d2b22d537